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ABSTRACT 

Background and objective Integrating data from 
multiple sources is a crucial and challenging problem. 
Even though there exist numerous algorithms for record 
linkage or deduplication, they suffer from either large 
time needs or restrictions on the number of datasets that 
they can integrate. In this paper we report efficient 
sequential and parallel algorithms for record linkage 
which handle any number of datasets and outperform 
previous algorithms. 

Methods Our algorithms employ hierarchical clustering 
algorithms as the basis. A key idea that we use is radix 
sorting on certain attributes to eliminate identical records 
before any further processing. Another novel idea is to 
form a graph that links similar records and find the 
connected components. 

Results Our sequential and parallel algorithms have 
been tested on a real dataset of 1 083 878 records and 
synthetic datasets ranging in size from 50 000 to 
9 000 000 records. Our sequential algorithm runs at 
least two times faster, for any dataset, than the previous 
best-known algorithm, the two-phase algorithm using 
faster computation of the edit distance (TPA (FCED)). 
The speedups obtained by our parallel algorithm are 
almost linear. For example, we get a speedup of 7.5 
with 8 cores (residing in a single node), 14.1 with 16 
cores (residing in two nodes), and 26.4 with 32 cores 
(residing in four nodes). 

Conclusions We have compared the performance of 
our sequential algorithm with TPA (FCED) and found 
that our algorithm outperforms the previous one. The 
accuracy is the same as that of this previous best-known 
algorithm. 
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INTRODUCTION 

Identifying duplicates in voluminous datasets is a 
crucial problem in many areas of science and engin- 
eering. This is especially true for medical records 
of individuals from different health agencies. 
Integration of medical records provides a great 
opportunity to analyze and evaluate disease evolu- 
tion. 1 2 Methods 3 exist for linking records across 
multiple medical data centers to identify disease 
origin and diversity. 4 Copy detection in digital 
documents also employs data integration techni- 
ques to detect similarities. 5 6 Data integration tech- 
niques integrate records across different data 
sources, usually in the absence of any global identi- 
fier. This is a way to identify individuals who have 
records in different datasets. If all the records per- 
taining to the same individual are exactly correct, 
the problem of identifying duplicates will be 
straightforward to solve. Unfortunately, records of 
the same person might look different owing to 
errors introduced by typing, phonetic similarity, 



etc. As a result, the record linkage problem is very 
challenging. Existing algorithms take a very long 
time, especially when the data size is large. Thus, it 
is still an important open problem to discover 
faster algorithms. In this paper we propose a 
sequential algorithm that is up to two orders of 
magnitude faster than one of the prior algorithms, 
the two-phase algorithm using faster computation 
of the edit distance (TPA (FCED)). 7 We also 
present a parallel algorithm that achieves a nearly 
linear speedup. 

A huge number of approaches have been devel- 
oped in the literature. Most of these algorithms 
link two datasets at a time. In practice, we have 
much more than two datasets. If we have two data- 
sets A and B and if n a and nb are the numbers of 
records in them, respectively, then in the worst case 
we have to process n a Xnb record pairs. 8 Some 
learning algorithms generate comparison vectors 
and classify them, 9 which takes a large amount of 
time to generate the vectors. 

As the basis for our algorithms, we have used 
hierarchical clustering, 10 11 which is also widely 
applied in information theory, 12 gene expres- 
sion, 13-16 data mining, 17 18 health psychology, 19 
and many other fields to identify distributions of 
corresponding objects or data. Our algorithms use 
the single linkage method to calculate distances. To 
reduce load on calculating linkages, we employ 
radix sort initially on records. 20 Our algorithms 
also consider different types of errors including 
typing distance, reversal of the first name and the 
last name, use of nicknames, truncation of attri- 
butes, etc. 7 We have thoroughly tested our algo- 
rithms on a large number of synthetic and real 
datasets. These tests show that the proposed algo- 
rithms outperform previous algorithms in terms of 
time and space. The parallel algorithm achieves a 
very nearly linear speedup. 

BACKGROUND AND SIGNIFICANCE 

Record linkage among multiple datasets typically 
involves millions of records and hundreds of thou- 
sands of individuals. The problem of record linkage 
can be thought of as one of clustering the records 
such that each cluster has records pertaining to one 
and only one individual. 21 Clustering, in general, is 
the process of partitioning objects so that similar 
objects are grouped into the same group (ie, 
cluster). A number of clustering methods can be 
found in the literature, including hierarchical clus- 
tering, graph-based clustering, statistical clustering, 
and centroid based clustering. Any clustering 
method employs a metric (known as linkage) for 
defining the distance between two clusters. 
Distance between two clusters indicates how similar 
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these two clusters are. In complete linkage, distance between 
two clusters (of records) A and B is defined as the maximum dis- 
tance between a record in A and a record in B, while single 
linkage uses the minimum distance. The distance between two 
given records can also be defined in a number of ways. 
Examples include the Levenshtein distance (also known as the 
edit distance) and the Hamming distance. 

Hierarchical clustering can be done in two different ways: (1) 
The agglomerative approach (bottom-up) starts with n clusters 
(where n is the number of records or points to be clustered), 
where each cluster has a single point. From there on clustering 
happens in iterations where in each iteration the two closest 
clusters are merged into one. Iterations stop when we have only 
a single cluster containing all the n points. The sequence of 
merging steps done in the algorithm can be represented as a tree 
called a dendrogram. If we have a target number of clusters in 
mind, we can cut the dendrogram at an appropriate level. The 
dendrogram can also be cut using a cluster threshold distance. 
(2) The divisive clustering approach (top-down) starts with a 
single cluster having all the n points. This cluster is then split 
hierarchically until we end up with n clusters, each cluster 
having a single point. 

In this paper we employ agglomerative hierarchical clustering, 
using single linkage. We treat each record as a string of charac- 
ters and define the distance between two records based on edit 
distance. Different kinds of common errors have been taken 
into account: reversal of first and last names, truncation of attri- 
butes, etc. 7 

METHODS 
Previous methods 

A simple brute force approach for record linkage is to compute 
the distance between every pair of records and identify the pair 
as a match or a non-match. This would take too much time. 
Some of the previous methods generate comparison vectors and 
define classification. 9 Cluster-based entity resolution that uses 
both relational and attribute information has been shown to 
perform better than attribute-based record linkage. 22 Linking 
several datasets using record linkage methods 23 and deduplica- 
tion 24 to merge records and remove repetitions are popular 
techniques. A wide range of studies on methods for record 
linkage have been done. 25 The expectation-maximization (EM) 
algorithm provides improved decision rule in the Fellegi-Sunter 
model of record linkage by employing probability estimation. 2 
Traditional probabilistic linkage models classify pairs of records 
as matches if they agree on some of their common attributes, 
and non-matches otherwise. 27 The probabilistic linkage system 
AutoMatch results in better linkage quality than some determin- 
istic ones, as shown in a recent study. 8 Many other probabilistic 
methods also exist. 28-30 Identity uncertainty and citation match- 
ing problems have been solved by the relational probability 
model. 31 Conditional models also cover the problem of identity 
uncertainty/ 2 Conditional random fields have been used to 
segment and label data. 33 These are also applied in a relational 
partitioning algorithm. 34 Multi-relational record linkage allows 
propagation of matches. 35 Personal name matching techni- 
ques/ 6 distance calculation, 37 matching methods, 38 automated 
correction of text techniques, 39 longest common substring, 40 
and many other techniques are also available for comparisons. 

FEBRL is famous for the linkage of two datasets. 41 42 
IntelliClean is another framework to identify duplicates by com- 
puting the transitive closure under uncertainty and anomalies 
efficiently. 43 The multi-pass approach for merge/purge problem 
considers alternate key attributes and applies these results to 



compute the transitive closure. 44 Many of these techniques use 
a blocking phase as a preprocessing step where the records are 
hashed into buckets (or blocks) based on some of the characters 
in the records, including canopy clustering. 45 Unsupervised and 
unconstrained partition-based clustering algorithms exist which 
are different from hierarchical clustering methods. 46 We have 
improved the TPA (FCED) algorithm, which is one of the 
fastest known record linkage algorithms, significantly. 

Some parallel algorithms for hierarchical clustering have been 
developed. 47-50 Parallel methods for record linkage also 
exist. 51-54 P-Swoosh uses match and merge processes, and also 
uses domain knowledge. 51 An algorithm that performs better 
than P-Swoosh has been reported. 52 This algorithm achieves an 
almost linear speedup, for example 6.55-7.49 on eight proces- 
sors. A different blocking technique in initial data partitioning 
followed by a matching phase has also been introduced. 53 54 
Algorithms that we propose in this paper are based on single 
linkage hierarchical clustering. Single linkage has been shown to 
perform better, from a time complexity perspective, over com- 
plete linkage and average linkage. 48 An analysis on different lin- 
kages in hierarchical clustering has also been conducted. 55 

Our approaches 

Naive algorithms for record linkage take 0(n 2 L 2 ) time, where n 
is the number of records and L is the maximum length of any 
record. The length of any record is nothing but the total aggre- 
gated length of all the attributes employed in the record linkage 
analysis. When the data size is very large, these algorithms take 
a very long time. Thus it was an important open problem to 
devise faster algorithms. To make the record linkage process 
faster and more reliable, we propose a very fast sequential algo- 
rithm and a parallel algorithm. 

Sequential algorithm 

The proposed algorithm is independent of the number of data- 
sets. Thus, we are able to integrate data from any number of 
datasets in an elegant way. It is true that any algorithm that links 
two datasets can be employed to integrate more than two data- 
sets by invoking the algorithm multiple times, each time inte- 
grating two. For example, if we have three datasets A, B, and C, 
we can first merge A and B to get A' and then merge A' and 
C. However, the output and accuracy of this approach will 
depend on the order in which these pairwise merges are done. 
In our sequential algorithm called RLA (record linkage algo- 
rithm), we collect all the records from all the datasets and form 
a collection X; we sort X after concatenating some or all of the 
common attributes (first name, last name, gender, address, etc.) 
in each record. Using this sorted list exact duplicates are elimi- 
nated. Two records are treated as identical if they agree on the 
common attributes. Note that in any record linkage algorithm 
record distances are calculated using only these common attri- 
butes. Let X' be the set of records remaining after the elimin- 
ation of duplicates. Clustering is performed on X'. We use 
blocking on X' based on 1 characters of the last names (for some 
suitable value of 1). Blocking may be done on last name, first 
name, or any other relevant attribute. In our experiments on 
real datasets we have realized that the use of last names yields 
the best accuracy. Each block consists of records that share an 
1-mer (ie, a substring of length 1) in the last names. An 1-mer is 
also referred to as an 1-gram in the literature. Two records rj 
and r 2 will be in the same block if they share at least one 1-mer 
in their last names. Since a record might share an 1-mer with 
many other records, it could be in many different blocks. If q is 
the maximum number of blocks that a record is in and if n' is 
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the number of records in X', then the expected size of each 
block is qnz/26 1 , assuming the English alphabet. Single linkage 
and edit distance are used for the clusters and records, respect- 
ively. Instead of constructing the entire dendrogram, we utilize a 
threshold x (an input parameter) to generate a partial dendro- 
gram that has only edges with distances no more than x. Then a 
graph G(X E) is generated in which V is X'. Two nodes in V 
have an edge between them if and only if they are in the same 
cluster of the partial dendrogram from some blocking. Thus, 
each connected component of G contains the records pertaining 
to one individual. 

Algorithm 1: RLA 

1. Collect all the records from all the datasets and form a 
single list X. 

2. Sort the records in X and form groups such that each group 
consists of identical records. Pick one record from each such 
group and let X' be the resultant collection of records. 

3. Do blocking on X'. Specifically, there could be a block for 
every possible 1-mer. (Note that there are 26 1 possible 1-mers 
when the alphabet corresponds to English.) Consider one 
such 1-mer y. If two records have y as an 1-mer in their last 
names then these two records will be in the block correspond- 
ing to y. If there is an 1-mer y' that does not occur in the last 
name of any record, then the block corresponding to y' will 
be empty. Also, the same record could be in many different 
blocks. So a record is going to be in (L — 1 + 1) blocks where L 
is the length of this record and the blocking size is 1. 

4. Cluster every block obtained in step 3. Employ hierarchical 
clustering with single linkage. Specifically, two records r! 
and r 2 will belong to the same cluster if the distance 
between them is no more than x. We have employed a fast 
algorithm for computing the edit distance between two 
records. This algorithm, also used in Mi et al, 7 takes O (xk) 
time where k is the minimum of the two record lengths and 
x is the specified threshold. 7 

5. We generate a graph G(V, E) where V is the collection X'. 
Two records have an edge between them if there exists at 
least one cluster in at least one block in which both of these 
records belong. 

6. Find the connected components of G(V, E). 

7. Output each connected component as a cluster. While out- 
putting a connected component, also output records that are 
identical to records in the component. (Note that informa- 
tion about identical records is available from step 2). 

Analysis 

The most time-consuming part of the proposed algorithm is the 
calculation of linkages between records in blocks to generate the 
graph G(V, E). Let b be the number of blocks in X', b a be the 
average number of records in a block, L be the maximum length 
of a record, n' be the number of records in X', and x be the 
threshold on the distance. The time complexity of algorithm 1 
(steps 3-7) is 0{bb 2 a LT). In practice we have noted that bb a =0(n') 
and hence it takes 0(n'b a Lx) time for steps 3-7. Clearly, the 
smaller the value of n' the better will be the run time. Steps 1 
and 2 of algorithm 1 take time that is linear in the size of X. We 
refer to the average number of (identical) duplicates we have for 
each record as multiplicity. Another prominent idea we have 
applied is to cache misses. As the cache memory of each proces- 
sor is limited and most of the times it is not enough to hold all 
the records, cache misses occur frequently. We handle this issue 
by copying frequently needed data into a separate array so that 
these data will be in contiguous memory locations. TPA (FCED) 



consumes a considerable amount of time in removing duplica- 
tion of linkages. We have cut this amount of time by considering 
a graph-based solution where we find connected components in 
linear time. 

Parallel algorithm 

We have parallelized the sequential algorithm (parallel record 
linkage algorithm, or PRLA), which achieves nearly linear 
speedups. We keep a copy of the input list X with each proces- 
sor. One of the processors is identified as the master and the 
other processors are called slaves. Let p be the number of slaves. 
The steps in the algorithm are enumerated below. 

Algorithm 2: PRLA 

1. The master broadcasts all the input records to the slave 
processors. 

2. Each processor sorts a portion of X in parallel. Specifically, 
the records of X are grouped based on the first two charac- 
ters of the last names. Note that there are 26 2 possible 
2-mers of characters and hence there are these many possible 
groups (some of which could be empty). Each processor 
sorts 26 2 /p groups. As a by-product of this sorting, each 
processor picks a representative from every group of identi- 
cal records that it sorted. In other words, we form X'. The 
slaves inform the master about their findings. 

3. The master assigns |X'|/p number of records from X' to 
each processor for the purpose of blocking. Each processor 
then performs blocking on its records and sends the blocks 
information to the master. 

4. The master aggregates the blocks. In particular, let y be 
some possible 1-mer. Parts of the block corresponding to y 
could be with multiple processors. The master merges these 
partial blocks. 

5. Let Bi,B 2 ,...,B t be the blocks in X. Note that t<26 1 , 
where 1 is the blocking size. Let nj = |Bj|, for 1 < i < t. The 
master sorts n 2 , n^, . . . , n 2 values in descending order. Let 
s = J2i=i n f- The master then distributes the blocks among 
the processors so that the work assigned to each processor is 
nearly even. Specifically, the distribution is such that the sum 
of squares of block sizes assigned to any processor is nearly 
s/p. 

6. The next task is to generate the graph G(V, E). To do this, 
each processor finds the edges in its blocks along the same 
lines as in the sequential algorithm. All of these edges from 
all the processors are sent to the master. 

7. The master finds the connected components in the graph. 
These connected components together with the initially 
removed copies of records yield us the clusters of interest. 

Analysis 

Let n be the number of records and n' be the number of distinct 
records in the input. Let L be the maximum length of any 
record in the input. 

In step 1, the broadcasting takes O(n) time. Grouping in step 
2 can be done by sorting the records based on two characters 
and hence this sorting step takes O(n) time as well. Once the 
groups are formed (based on two characters), we can expect 
each group to have n/26 2 records and hence the sorting of 
groups takes an expected 0(n/p) steps. The communication of 
the slaves with the master takes O(n) time. 

In step 3, the master sends a subset of X' to each of the 
slaves. This communication takes O(n') time. If 1 is the blocking 
size, then, each processor spends 0(n'/p(L — I + 1)) time in 
forming the blocks. Note that there will be a total of 26 1 blocks. 
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Each slave sends the master information about its blocks. In par- 
ticular, for every block it sends a list of indices of all the records 
that belong to this block. As a result, the amount of information 
sent from each slave to the master is 0(n'/p(L — / + 1)). 
Therefore, the total communication time in this step is 0(n'(L 
-1+1))- 

In step 4, aggregation of the blocks received from all the 
slaves in step 3 is done in 0(n'(L — 1+1)) time by the master. 
Then a sorting is done on the list of sizes of the blocks. This 
takes 0(26') time using radix sort. 

In step 5, the blocks are distributed among the slaves such 
that the value of s is nearly balanced across the slaves. Note that 
this problem is NP-complete. We use the sum of squares of 
block sizes to compute s for the following reason. To compute 
the edges within each block, in the worst case, each record is 
compared with every other record. As a result, the worst case 
time spent on each block is proportional to the square of the 
block size. We have tried several ways of distributing the blocks. 
In each of these ways, a block might get split between two adja- 
cent processors to ensure a close partitioning. Therefore, each 
of the techniques we have employed does not guarantee an 
exactly even partitioning (or an optimal partitioning). One 
simple partitioning we have used is to use the sorted list 
Q = n 2 , n|, . . . , n 2 . We will identify a minimum prefix of this 
sequence whose sum equals or exceeds s/p. If this prefix sum 
equals s/p, then this prefix sequence of blocks will be assigned 
to the first processor. If this prefix sum exceeds s/p, then the last 
block in this prefix sequence will be split between the first and 
the second processors. The splitting will be done to ensure that 
the work assigned to the first processor is as close to s/p as pos- 
sible. By the work assigned to a processor we mean the sum of 
squares of the blocks assigned to the processor. In the case of 
the prefix sum exceeding s/p, a portion of the last block in this 
prefix sequence will be assigned to the second processor. The 
second processor will also be assigned the next number of 
blocks in the sorted sequence Q. This number of blocks will be 
such that the work assigned to this processor is nearly s/p, and 
so on. The time taken by the master in step 5 is O(t) where t is 
the number of blocks. If the blocking size is 1, then t<26 1 . After 
this, the master creates a list of records for each slave to work 
on. This takes 0(n'(L — 1+1)) time. Subsequently, the master 
sends the individual lists to the slaves. This communication also 
takes 0(n'(L - 1 + 1)) time. 

In step 6, each processor works on its blocks. The time spent 
in this step is 0(s/p). Note that the expected size of each block 
is n' (L — l+l)/26. Also, the time spent in computing the dis- 
tance between any two records is O(xL). Thus the expected 
value of s is ((n') 2 (L — / + 1) 2 /26')tL. Our empirical results 
indicate that the total number of edges generated across all the 
processors is 0(n'(L — /+1)). In this case, the communication 
time is O(n'). As a result, the connected components in step 7 
can also be found in 0(n'(L — 1+1)) time. 

In summary, the total expected run time of the algorithm is 
0(n + n'(L-l+l) + ((n') 2 (L-l + l) 2 /px26 l )TL). It turns 
out that the last term is the dominating one among the three 
terms in this time complexity. Table 4 explains why we get a 
speedup that is close to linear. Please note that blocking is 
quite useful in reducing the run time. For example, even if 
L=15, for a value of 1=3, the value of (n') 2 (L - I + l) 2 /26 / is 
0.0096 (n') 2 . 

Also, the run times of most of the (sequential and parallel) 
algorithms found in the literature depend on n 2 . Thus the work 
done by our algorithm is expected to be significantly better than 
competing algorithms since our run time depends on (n') 2 . In 



practice the value of (n') 2 is much smaller than that of n 2 . 
Although parallel algorithms exist (see Greiner, 5 for example) 
for finding connected components, we have not used them here 
since the time needed for this step is very small. 

RESULTS 

We have implemented our sequential version for simulated data 
in C + + to make a better comparison with the parallel version, 
as PRLA has been implemented using MPI with C + + . We have 
also used C + + implementation of the TPA (FCED) algorithm 
to compare with our sequential version. As TPA (FCED) was 
originally implemented in java, we have also implemented our 
algorithm in java to make a fair comparison with the results in 
Mi et al. Our sequential algorithm outperforms TPA (FCED), 7 
especially when the multiplicity is large. 

We have tested our algorithms on both synthetic and real 
data. We have collected real datasets from the Connecticut 
Health Information Network (CHIN). As TPA (FCED) 7 ensures 
very high accuracy of record linkage but consumes a large 
amount of time, our main purpose was to provide a much faster 
solution. So we have developed our algorithms in such a way 
that the accuracy remains the same, but the algorithms run 
much faster. In the blocking phase, we have used 4-mer for all 
the experiments. The value of 1 in the blocking phase has to be 
chosen carefully. If 1 is low, the accuracy will be high. A higher 
value will result in a reduction in the run time but the accuracy 
might suffer. 

Results on simulated data for the sequential algorithm 

The implementation has been deployed in the HORNET cluster 
housed in the Booth Engineering Center for Advanced 
Technology (BECAT), University of Connecticut. This cluster 
has 64 nodes, each of which has 12 Intel Xeon X5650 
Westmere cores, 48 GB of RAM, and 500 GB of local storage. 

Running time of our algorithms is independent of the 
number of datasets as we add all the records to a single list and 
work with only this list. As in TPA (FCED), 7 we have employed 
both constant and proportional threshold values in the cluster- 
ing step. Our algorithm has been tested for each type of distance 
calculation. The total number of records used for this test 
ranges from 50 000 to 5 000 000 to reveal the power of our 
algorithm. Five records have been generated for each individual, 
in which four are error free. So, on the five records of any indi- 
vidual, exact clustering will find two clusters. 

To compare with TPA (FCED), we employ edit distances of 
two attributes, namely the first name and the last name. TPA 
(FCED) spends around 650.49 s for 1 000 000 data whereas 
our algorithm takes only 92.99 s, which is seven times faster for 
this amount of data. Table 1 summarizes the comparison. 
Figure 1 provides a graphical representation of this comparison. 

When the input data contains a large number of records, TPA 
(FCED) spends too much time to complete. Table 2 displays the 
time taken by RLA on various steps. 

When we have 1 000 000 records, finding clusters using exact 
matching (steps 3-6 in the sequential algorithm) takes only 
8.8 s. The size of X', after removing duplicates, is only 387 707. 
From table 1, we see that TPA (FCED) takes around 178.74 s to 
find clusters for 400 000 records. But RLA clusters 387 707 
records by approximate clustering within 83.21 s. This improve- 
ment is because of the graph-based solution and the avoidance 
of cache misses. So clustering of 1 000 000 records takes only 
92.99 s. Even when the multiplicity is 1, our algorithm runs 
around two times faster than TPA (FCED). Since in practice the 
multiplicity of data is more than 1, our algorithms run much 
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Table 1 Comparison of results on simulated data 



Number of records 


Algorithm 


Run time in seconds 


50 000 


TPA (FCED) 


7.35 




RLA 


1.19 


100 000 


TPA (FCED) 


24.81 




RLA 


3.67 


200 000 


TPA (FCED) 


71.47 




RLA 


10.25 


400 000 


TPA (FCED) 


178.74 




RLA 


26.09 


600 000 


TPA (FCED) 


324.82 




RLA 


45.99 


800 000 


TPA (FCED) 


489.43 




RLA 


68.67 


1 000 000 


TPA (FCED) 


650.49 




RLA 


92.99 


2 000 000 


TPA (FCED) 


1844.52 




RLA 


256.51 


3 000 000 


TPA (FCED) 






RLA 


490.54 


4 000 000 


TPA (FCED) 






RLA 


800.02 


5 000 000 


TPA (FCED) 






RLA 


1123.85 



-, the algorithm took too long to terminate. 



faster as shown in figure 1. Our proposed algorithm is more 
than 20 times faster than the previous algorithm TPA (FCED) 
on the datasets of records having a multiplicity of 5. Figure 2 is 
a graphical representation of table 2. 

A similar experiment, which uses reversal edit distance, also 
shows superiority of the RLA algorithm. Reversal edit distance 
takes in two groups of attributes, calculates edit distance in both 
original direction and reversal direction, and returns the smaller 
one. In our experiments, we aggregate the edit distance of the 
first attributes of the two records and the edit distance of the 



Table 2 Analysis of results on simulated data (RLA) 



Number 


Number 


Number 


Exact 


Approx 






OT 


of exact 


OT 


cluster 


cluster 


Merge 


Total 


records 


clusters 


clusters 


time 


time 


time 


time 


50 000 


19 582 


12 130 


0.21 


0.95 


0.03 


1.19 


100 000 


39 201 


23 965 


0.48 


3.11 


0.08 


3.67 


200 000 


78 453 


46 487 


1.11 


8.97 


0.17 


10.25 


400 000 


1 56 934 


88 725 


2.71 


23.04 


0.34 


26.09 


600 000 


232 866 


130 746 


4.67 
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5.44 
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1041.28 
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second attributes of them. Again we add the edit distance 
between the first attribute of the first record and the second 
attribute of the second record and the edit distance between the 
second attribute of the first record and the first attribute of the 
second record. We then take the smaller of these two distances, 
as this is the reversal distance value. Figure 3 shows almost the 
same efficiency for RLA on this distance as well. 

But in this case, both the algorithms take more time than that 
for the previous distance calculation as two edit distances are 
needed to be calculated as per the definition of reversal 
distance. 

We have performed another experiment using edit distance as 
the distance method but adding a parameter, namely truncation 
count. We have used a truncation count of 2, which means that 
we only employ the first two characters of any attribute con- 
cerned. Both the algorithms produce more clusters in this case. 
The process is slow since more linkages will have to be dealt 
with. Figure 4 shows the comparison. 
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Figure 1 Results on synthetic data (y axis denotes time in seconds; x axis corresponds to number of records in thousands). RLA, record linkage 
algorithm; TPA (FCED),two-phase algorithm using faster computation of the edit distance. 
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Figure 2 Analysis of results on 
synthetic data using the record linkage 
algorithm (y axis denotes time in 
seconds; x axis corresponds to number 
of records in thousands). 
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In the above cases, we have used constant threshold to find 
clusters. The next test shows results for using proportional 
threshold, which is dependent on the length of the considered 
attributes. Results are shown in figure 5. 

Proportional threshold sometimes works better as it is 
dependent on the data. We omit details on the proportional 
threshold, as the procedure is similar to the constant threshold. 

Clearly, the threshold has a great impact on the accuracy of 
clusters as a too small or too large threshold will normally yield 
a low error-rate. That is why a training phase is needed to learn 
the threshold. 



Results on real data for the sequential algorithm 

Our experiments on real data have been conducted on the CHIN 
server for security reasons. The computer has a CPU of Intel(R) 
Xeon(R) X5460, 3.16 GHz, and 4 GB RAM. The data come from 
four different datasets having a total of 1 083 878 records. 

Table 3 shows the comparison. RLA employs two attributes, 
namely the first name and the last name. Within 15 s, it outputs 
112 404 exact clusters. The rest of the steps take around 19 s. 
The algorithm terminates within 34.5 s whereas TPA (FCED) 
spends around 2961 s. RLA is 85 times faster than TPA (FCED) 
for this real data. The accuracy is 93.0% for both. 
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Figure 3 Comparison on reversal edit distance (y axis denotes time in seconds; x axis corresponds to number of records in thousands). RLA, record 
linkage algorithm; TPA (FCED),two-phase algorithm using faster computation of the edit distance. 
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Figure 4 Comparison on truncation edit distance (y axis denotes time in seconds; x axis corresponds to number of records in thousands). 
RLA, record linkage algorithm; TPA (FCED),two-phase algorithm using faster computation of the edit distance. 



We have also used the date of birth attribute in addition to 
the above two attributes. The running time is also impressive. 
RLA takes only 48.7 s whereas TPA (FCED) takes 3402 s. In 
this case, RLA is 70 times faster; 97.8% accuracy is achieved for 
these data since the use of a larger number of attributes removes 
many occurrences of false positives. 



Results on simulated data for the parallel algorithm 

In our experiments, we have used at most 32 cores from four 
nodes, eight from each node. In this case, we have used another 
set of synthetic data, in which the multiplicity is nearly 1. 

An algorithm is fully parallel when the speedup is linear. We 
have optimized our algorithm to make it almost linear. Table 4 
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Figure 5 Results on synthetic data using proportional threshold (t=0.1, y axis denotes time in seconds; x axis corresponds to number of records in 
thousands). RLA, record linkage algorithm. 
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Table 3 Results on real datasets (1 083 878 records) 


Number of attributes Algorithm 


Time (s) 


Created clusters 


Correct clusters 


Number of individuals 


Accuracy % 


Com. % 


2 TPA (FCED) 


2961 


94 381 


87 756 


108 800 


93.0 


80.7 


RLA 


34.5 












3 TPA (FCED) 


3402 


101 864 


99 562 


108 800 


97.8 


91.6 


RLA 


48.7 













analyzes the running time of PRLA for 6 million records. The 
first column shows the number of cores used. The total time 
spent in broadcast operations that take place in steps 1, 3, and 5 
is shown as beast. The total time for the other communications 
that happen in steps 2, 3, 4, and 6 is shown as comm. As we 
can readily see, these communication overheads are very low. 
Master performs certain tasks on its own in steps 3, 4, 5, and 
7. This total time is displayed as master in table 4. The time for 
sorting and finding duplicates in step 2 is dedup. The total time 
for blocking {block, in step 3), merging {merge, in step 4), distri- 
bution of blocks {dist, in step 5), and finding connected compo- 
nents {concomp, in step 7) is very low as well. Generating edge 
lists is the major time consuming step. This time is shown as 
edgelist. The fact that this step dominates the entire run time is 
also revealed in our time complexity analysis above. The first 



row, seq, shows the runtime consumed by sequential RLA. 
Figure 6 graphically describes the data in table 4. 

The time results are also shown in figure 7. The x-axis repre- 
sents the number of cores used and the y-axis shows time in 
seconds. 

Our results show that the speedup is around 7.5 for eight cores 
(that reside in a single node), 14.1 for 16 cores (residing in two 
nodes), and 26.4 for 32 cores. Values show almost linearity in 
speedup (figure 8). We have tested on 1, 2, 4, 8, 16, and 32 cores. 

DISCUSSION 

Our algorithms ensure the same accuracy as the previous algo- 
rithm TPA (FCED). Accuracy and completeness have been calcu- 
lated on real dataset. Social Security Number (SSN) or DDS 
identification number was available for these records that we 



Table 4 Distribution of running time for 6 000 000 records 

Pr Total time 
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Figure 6 Analysis of results on 
synthetic data using the parallel record 
linkage algorithm (y axis denotes time 
in seconds; x axis corresponds to 
number of processors). 
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Figure 7 Results on simulated data (for 6 million and 9 million records; y axis denotes time in seconds; x axis corresponds to number of 
processors). 



utilized for calculating the accuracy. These numbers were 
revealed to us only after our algorithms produced the results. 

To cluster records more accurately, an appropriate threshold 
value is necessary. Such a threshold can be obtained in a learn- 
ing process as described in Mi et al. 7 The idea is to have a train- 
ing phase in which records for which the right clustering is 
known will be utilized. The whole procedure is described elab- 
orately in Mi et al. 7 We have used a constant threshold value of 
1 and a proportional threshold value of 0.1. 



Besides using edit distance, we have also employed reversal 
edit distance and truncation distance. A common error occur- 
ring in records is the reversal of the first and last names. In 
these cases, reversal edit distance will yield better results. 
Truncation distance is used when a specific portion of records is 
sufficient for determining the clusters. All of these distance cal- 
culations make our algorithms versatile. 

We experimented on four real datasets of total size 1 083 878 
records. Two datasets came from the University of Connecticut's 
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Figure 8 Speedup (for 6 million and 9 million records; y axis denotes speed up; x axis corresponds to number of processors). 
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Dental Clinic (UCHC) and two from the Connecticut 
Department of Development services (DDS). 

To generate simulated data, we collected 200 000 records of 
dead people from ssdmf.info. Each record has SSN, last name, 
first name, middle name, date of death, and date of birth attri- 
butes. Then we introduced 2-3 new characters in the first name 
or last name for 90% of the records. For the others, we have 
altered 1-3 characters of the first name or last name. We have 
thus generated 1 000 000 records. Then we replicated the file 
three times. We also generated another eight datasets of 

I 000 000 records, introducing errors using the above 
procedure. 

CONCLUSIONS 

To integrate a huge number of records across multiple datasets, 
especially from diverse medical and health datasets, our algo- 
rithms ensure very fast solutions with high accuracy. 

The overall runtime of our algorithms depends on the multi- 
plicity. Even for a multiplicity of 1, our algorithm is faster than 
TPA (FCED) by a factor of 2. For larger multiplicities, our algo- 
rithm achieves impressive speedups over TPA (FCED). For 
instance, if the multiplicity is 10, then the speedup is more than 
100. Runtime and accuracy of our algorithms also depend on 
the value of 1 used for blocking. In general, some learning tech- 
niques should be applied to figure out a good threshold value. 
The parallel algorithm achieves a nearly linear speedup. 
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